GH-3451. Add a JMH benchmark for variants by steveloughran · Pull Request #3452 · apache/parquet-java

steveloughran · 2026-03-19T10:44:55Z

Rationale for this change

There's no benchmark for variant IO and so there's no knowledge of any problems which exist now, or any way to detect regressions.

What changes are included in this PR?

adds parquet-variant to parquet-benchmark dependencies
new JMH benchmark VariantBenchmark

Are these changes tested?

Manually, initial PR doesn't fork the JVM for each option.

Benchmark                               (depth)  (fieldCount)  Mode  Cnt      Score      Error  Units
VariantBenchmark.benchmarkBuildVariant  Shallow          1000    ss  100   2007.783 ±  244.861  us/op
VariantBenchmark.benchmarkBuildVariant  Shallow         10000    ss  100  16214.358 ± 1048.142  us/op
VariantBenchmark.benchmarkBuildVariant   Nested          1000    ss  100   1544.472 ±   91.232  us/op
VariantBenchmark.benchmarkBuildVariant   Nested         10000    ss  100  15312.341 ±  226.414  us/op
VariantBenchmark.benchmarkDeserialize   Shallow          1000    ss  100    893.913 ±   36.284  us/op
VariantBenchmark.benchmarkDeserialize   Shallow         10000    ss  100   8499.729 ±  197.003  us/op
VariantBenchmark.benchmarkDeserialize    Nested          1000    ss  100    907.712 ±   80.187  us/op
VariantBenchmark.benchmarkDeserialize    Nested         10000    ss  100   8450.447 ±  163.247  us/op
VariantBenchmark.benchmarkSerialize     Shallow          1000    ss  100     16.095 ±   29.358  us/op
VariantBenchmark.benchmarkSerialize     Shallow         10000    ss  100      6.416 ±    7.178  us/op
VariantBenchmark.benchmarkSerialize      Nested          1000    ss  100      3.777 ±    0.528  us/op
VariantBenchmark.benchmarkSerialize      Nested         10000    ss  100      3.956 ±    0.536  us/op
VariantBenchmark.writeShredded          Shallow          1000    ss  100   1943.923 ±  103.121  us/op
VariantBenchmark.writeShredded          Shallow         10000    ss  100  20139.185 ±  341.913  us/op
VariantBenchmark.writeShredded           Nested          1000    ss  100   1920.326 ±   42.812  us/op
VariantBenchmark.writeShredded           Nested         10000    ss  100  20980.458 ±  539.303  us/op
VariantBenchmark.writeUnshredded        Shallow          1000    ss  100     29.876 ±   44.216  us/op
VariantBenchmark.writeUnshredded        Shallow         10000    ss  100     17.380 ±   39.148  us/op
VariantBenchmark.writeUnshredded         Nested          1000    ss  100      3.254 ±    1.061  us/op
VariantBenchmark.writeUnshredded         Nested         10000    ss  100     16.602 ±   33.320  us/op

there's 100 iterations per benchmark because some of the unshredded/small object operations are so fast that clock granuarity becomes an issue.

Are there any user-facing changes?

No

Closes #3451

Initial impl.

steveloughran · 2026-03-19T18:21:01Z

Still thinking of what else can be done here...suggestions welcome.

Probably a real write to the localfs and read back in

steveloughran · 2026-03-23T19:01:36Z

I'll add a "deep" option too, for consistency with the iceberg pr

xiaoxuandev · 2026-03-27T00:35:04Z

parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/VariantBenchmark.java

+  private static int count() {
+    int c = counter++;
+    if (c >= 512) {
+      c = 0;


only resets the local copy, counter keeps growing？

good point. will fix.

* deser to recurse down * include uuid and bigdecimal * reset counter on benchmark setup

iterations of class code and #of rows are the same for easy compare of overheads.

Using the same structure as the iceberg tests do

painful.

steveloughran · 2026-03-30T18:42:19Z

There's now a new benchmark which writes a file using the same simple schema as I'm doing in iceberg apache/iceberg#15629 , and tries to do a projection on it.

 SELECT id, category, variant_get('nested.varcategory') FROM table

Review by the copilot

Setup: 1M rows, 4-field nested variant (idstr, varid, varcategory, col4), querying varcategory only. SingleShotTime, 15 iterations, @fork(0).

Raw Results


  ┌───────────────────────────┬──────────┬───────────────┬─────────┬────────┐
  │ Benchmark                 │ shredded │ Score (ms/op) │ Error   │ µs/row │
  ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
  │ readAllRecords            │ false    │ 728.514       │ ±11.253 │ 0.729  │
  ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
  │ readProjectedFileSchema   │ false    │ 760.287       │ ±3.314  │ 0.760  │
  ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
  │ readProjectedLeanSchema   │ false    │ 1405.264      │ ±8.399  │ 1.405  │
  ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
  │ readAllRecords            │ true     │ 1315.615      │ ±14.598 │ 1.316  │
  ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
  │ readProjectedFileSchema   │ true     │ 1297.870      │ ±19.621 │ 1.298  │
  ├───────────────────────────┼──────────┼───────────────┼─────────┼────────┤
  │ readProjectedLeanSchema   │ true     │ 725.618       │ ±10.574 │ 0.726  │
  └───────────────────────────┴──────────┴───────────────┴─────────┴────────┘

Speedup/Penalty vs readAllRecords Baseline


  ┌───────────────────────────┬──────────────────┬──────────────────┐
  │ Benchmark                 │ shredded=false   │ shredded=true    │
  ├───────────────────────────┼──────────────────┼──────────────────┤
  │ readProjectedFileSchema   │ −4% (overhead)   │ +1% (noise)      │
  ├───────────────────────────┼──────────────────┼──────────────────┤
  │ readProjectedLeanSchema   │ −93% penalty     │ +45% speedup     │
  └───────────────────────────┴──────────────────┴──────────────────┘

Lean schema projection is the only technique that skips columns. Projecting the full file schema (readProjectedFileSchema) gives zero benefit in either case — Parquet still reads all column chunks.
Lean schema + shredded = 45% faster than reading all columns. Skipping idstr, varid, and col4 typed columns saves ~590ms per 1M rows.
Lean schema + unshredded = 93% slower. The lean schema requests typed_value.varcategory which does not exist in the unshredded file. Parquet handles the missing columns at every row, which is more expensive than
reading the single binary blob directly.
Schema detection in ReadSupport.init() is essential. Applying containsField("typed_value") to choose between lean and full schema prevents the unshredded penalty while preserving the shredded speedup.

Recommendation

Always detect file layout in ReadSupport.init() and apply the lean projection only when the file was written with a shredded schema. For unshredded files, use the full file schema or no projection.

If you have a query with a pushdown predicate that wants to look inside a variant, creating a MessageType schema referring to the shredded values is counterproductive unless you know that the variant is shedded.

That can be determined by looking at the schema and use `.containsField("typed_value") to see if the target variant has any nested values.

    @Override
    public ReadContext init(InitContext context) {
      MessageType fileSchema = context.getFileSchema();
      GroupType nested = fileSchema.getType("nested").asGroupType();
      if (nested.containsField("typed_value")) {
        return new ReadContext(VARCATEGORY_PROJECTION);
      }
      // Unshredded file: projection designed for typed columns provides no benefit and
      // causes schema mismatch overhead — fall back to the full file schema.
      return new ReadContext(fileSchema);
    }

steveloughran · 2026-03-31T13:31:22Z

build failures are all because java11 javadoc is extra-fussy than the versions either side of it.

apacheGH-3451. Add a JMH benchmark for variants

3e35efd

Initial impl.

steveloughran marked this pull request as draft March 19, 2026 10:45

steveloughran mentioned this pull request Mar 20, 2026

Core, Spark: Add JMH benchmarks for Variants apache/iceberg#15628

Open

3 tasks

steveloughran marked this pull request as ready for review March 24, 2026 14:58

steveloughran added 2 commits March 25, 2026 13:54

WiP: add a deeper version.

520c80c

revert plans for a deeper version

63e6096

xiaoxuandev reviewed Mar 27, 2026

View reviewed changes

steveloughran added 6 commits March 27, 2026 11:40

fixes to this benchmark (copilot review)

85a2570

* deser to recurse down * include uuid and bigdecimal * reset counter on benchmark setup

Measure parquet write/read costs.

4edec9a

iterations of class code and #of rows are the same for easy compare of overheads.

WiP, on a file read/write benchmark

e6ccbfc

Using the same structure as the iceberg tests do

building a projection schema

0889e2d

projection works

2e29ea6

Test to validate impact of lean schema against unshredded files.

77bfce4

painful.

steveloughran mentioned this pull request Mar 30, 2026

Spark: Support writing shredded variant in Iceberg-Spark apache/iceberg#14297

Open

apacheGH-561 ongoing work

c1c333d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3451. Add a JMH benchmark for variants#3452

GH-3451. Add a JMH benchmark for variants#3452
steveloughran wants to merge 10 commits intoapache:masterfrom
steveloughran:pr/benchmark-variant

steveloughran commented Mar 19, 2026

Uh oh!

steveloughran commented Mar 19, 2026

Uh oh!

steveloughran commented Mar 23, 2026

Uh oh!

xiaoxuandev Mar 27, 2026

Uh oh!

steveloughran Mar 30, 2026 •

edited

Loading

Uh oh!

steveloughran commented Mar 30, 2026

Uh oh!

steveloughran commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

steveloughran commented Mar 19, 2026

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

steveloughran commented Mar 19, 2026

Uh oh!

steveloughran commented Mar 23, 2026

Uh oh!

xiaoxuandev Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

steveloughran Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steveloughran commented Mar 30, 2026

Always detect file layout in ReadSupport.init() and apply the lean projection only when the file was written with a shredded schema. For unshredded files, use the full file schema or no projection.

Uh oh!

steveloughran commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

steveloughran Mar 30, 2026 •

edited

Loading